class: center, middle, inverse, title-slide # ECON 3818 ## Chapter 15 ### Kyle Butts ### 26 July 2021 --- exclude: true --- class: clear, middle <!-- Custom css --> <style type="text/css"> @import url(https://fonts.googleapis.com/css?family=Zilla+Slab:300,300i,400,400i,500,500i,700,700i); /* Create a highlighted class called 'hi' */ .hi { font-weight: 600; } .bw { background-color: rgb(0, 0, 0); color: #ffffff; } .gw { background-color: #d2d2d2; color: #ffffff; } /* Font styling */ .mono { font-family: monospace; } .ul { text-decoration: underline; } .ol { text-decoration: overline; } .st { text-decoration: line-through; } .bf { font-weight: bold; } .it { font-style: italic; } /* Font Sizes */ .bigger { font-size: 125%; } .huge{ font-size: 150%; } .small { font-size: 95%; } .smaller { font-size: 85%; } .smallest { font-size: 75%; } .tiny { font-size: 50%; } /* Remark customization */ .clear .remark-slide-number { display: none; } .inverse .remark-slide-number { display: none; } .remark-code-line-highlighted { background-color: rgba(249, 39, 114, 0.5); } .remark-slide-content { background-color: #ffffff; font-size: 24px; /* font-weight: 300; */ /* line-height: 1.5; */ /* padding: 1em 2em 1em 2em; */ } /* Xaringan tweeks */ .inverse { background-color: #23373B; text-shadow: 0 0 20px #333; /* text-shadow: none; */ } .title-slide { background-color: #ffffff; border-top: 80px solid #ffffff; } .footnote { bottom: 1em; font-size: 80%; color: #7f7f7f; } /* Mono-spaced font, smaller */ .mono-small { font-family: monospace; font-size: 20px; } .mono-small .mjx-chtml { font-size: 103% !important; } .pseudocode, .pseudocode-small { font-family: monospace; background: #f8f8f8; border-radius: 3px; padding: 10px; padding-top: 0px; padding-bottom: 0px; } .pseudocode-small { font-size: 20px; } .super{ vertical-align: super; font-size: 70%; line-height: 1%; } .sub{ vertical-align: sub; font-size: 70%; line-height: 1%; } .remark-code { font-size: 68%; } .inverse > h2 { color: #e64173; font-weight: 300; font-size: 40px; font-style: italic; margin-top: -25px; } .title-slide > h2 { margin-top: -25px; padding-bottom: -20px; color: rgba(249, 38, 114, 0.75); text-shadow: none; font-weight: 300; font-size: 35px; font-style: normal; text-align: left; margin-left: 15px; } .remark-inline-code { background: #F5F5F5; /* lighter */ /* background: #e7e8e2; /* darker */ border-radius: 3px; padding: 4px; } /* 2/3 left; 1/3 right */ .more-left { float: left; width: 63%; } .less-right { float: right; width: 31%; } .more-right ~ * { clear: both; } /* 9/10 left; 1/10 right */ .left90 { padding-top: 0.7em; float: left; width: 85%; } .right10 { padding-top: 0.7em; float: right; width: 9%; } /* 95% left; 5% right */ .left95 { padding-top: 0.7em; float: left; width: 91%; } .right05 { padding-top: 0.7em; float: right; width: 5%; } .left5 { padding-top: 0.7em; margin-left: 0em; margin-right: -0.4em; float: left; width: 7%; } .left10 { padding-top: 0.7em; margin-left: -0.2em; margin-right: -0.5em; float: left; width: 10%; } .left30 { padding-top: 0.7em; float: left; width: 30%; } .right30 { padding-top: 0.7em; float: right; width: 30%; } .thin-left { padding-top: 0.7em; margin-left: -1em; margin-right: -0.5em; float: left; width: 27.5%; } /* Example */ .ex { font-weight: 300; color: #cccccc !important; font-style: italic; } .col-left { float: left; width: 47%; margin-top: -1em; } .col-right { float: right; width: 47%; margin-top: -1em; } .clear-up { clear: both; margin-top: -1em; } /* Format tables */ table { color: #000000; font-size: 14pt; line-height: 100%; border-top: 1px solid #ffffff !important; border-bottom: 1px solid #ffffff !important; } th, td { background-color: #ffffff; } table th { font-weight: 400; } /* Extra left padding */ .pad-left { margin-left: 5%; } /* Extra left padding */ .big-left { margin-left: 15%; margin-bottom: -0.4em; } /* Attention */ .attn { font-weight: 500; color: #e64173 !important; font-family: 'Zilla Slab' !important; } /* Note */ .note { font-weight: 300; font-style: italic; color: #314f4f !important; /* color: #cccccc !important; */ font-family: 'Zilla Slab' !important; } /* Question and answer */ .qa { font-weight: 500; /* color: #314f4f !important; */ color: #e64173 !important; font-family: 'Zilla Slab' !important; } /* Remove orange line */ hr, .title-slide h2::after, .mline h1::after { content: ''; display: block; border: none; background-color: #e5e5e5; color: #e5e5e5; height: 1px; } </style> <!-- From xaringancolor --> <div style = "position:fixed; visibility: hidden"> `\(\require{color}\definecolor{red_pink}{rgb}{0.901960784313726, 0.254901960784314, 0.450980392156863}\)` `\(\require{color}\definecolor{turquoise}{rgb}{0.125490196078431, 0.698039215686274, 0.666666666666667}\)` `\(\require{color}\definecolor{orange}{rgb}{1, 0.647058823529412, 0}\)` `\(\require{color}\definecolor{red}{rgb}{0.984313725490196, 0.380392156862745, 0.0274509803921569}\)` `\(\require{color}\definecolor{blue}{rgb}{0.231372549019608, 0.231372549019608, 0.603921568627451}\)` `\(\require{color}\definecolor{green}{rgb}{0.545098039215686, 0.694117647058824, 0.454901960784314}\)` `\(\require{color}\definecolor{grey_light}{rgb}{0.701960784313725, 0.701960784313725, 0.701960784313725}\)` `\(\require{color}\definecolor{grey_mid}{rgb}{0.498039215686275, 0.498039215686275, 0.498039215686275}\)` `\(\require{color}\definecolor{grey_dark}{rgb}{0.2, 0.2, 0.2}\)` `\(\require{color}\definecolor{purple}{rgb}{0.415686274509804, 0.352941176470588, 0.803921568627451}\)` `\(\require{color}\definecolor{slate}{rgb}{0.192156862745098, 0.309803921568627, 0.309803921568627}\)` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { red_pink: ["{\color{red_pink}{#1}}", 1], turquoise: ["{\color{turquoise}{#1}}", 1], orange: ["{\color{orange}{#1}}", 1], red: ["{\color{red}{#1}}", 1], blue: ["{\color{blue}{#1}}", 1], green: ["{\color{green}{#1}}", 1], grey_light: ["{\color{grey_light}{#1}}", 1], grey_mid: ["{\color{grey_mid}{#1}}", 1], grey_dark: ["{\color{grey_dark}{#1}}", 1], purple: ["{\color{purple}{#1}}", 1], slate: ["{\color{slate}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .red_pink {color: #E64173;} .turquoise {color: #20B2AA;} .orange {color: #FFA500;} .red {color: #FB6107;} .blue {color: #3B3B9A;} .green {color: #8BB174;} .grey_light {color: #B3B3B3;} .grey_mid {color: #7F7F7F;} .grey_dark {color: #333333;} .purple {color: #6A5ACD;} .slate {color: #314F4F;} </style> ## Chapter 15: Parameters and Statistics --- # Parameters and Statistics We have discussed using sample data to make inference about the population. In particular, we will use sample .hi.green[statistics] to make inference about population .hi.purple[parameters]. A .hi.purple[parameter] is a number that describes the population. In practice, parameters are unknown because we cannot examine the entire population. A .hi.green[statistic] is a number that can be calculated from sample data without using any unknown parameters. In practice, we use statistics to estimate parameters. --- # Greek Letters and Statistics .pull-left[ .hi.green[Latin Letters] - Latin letters like `\(\bar{x}\)` and `\(s^2\)` are calculations that represent guesses (estimates) at the population values. ] .pull-right[ .hi.purple[Greek Letters] - Greek letters like `\(\mu\)` and `\(\sigma^2\)` represent the truth about the population. ] The goal for the class is for the latin letters to be good guesses for the greek letters: $$ {\color{green}\text{Data}} \longrightarrow {\color{green}\text{Calculation}} \longrightarrow {\color{green}\text{Estimates}} \longrightarrow^{hopefully!} {\color{purple}\text{Truth}} $$ For example, $$ {\color{green}X} \longrightarrow {\color{green} \frac{1}{n} \sum_{i=1}^n X_i} \longrightarrow {\color{green}\bar{x}} \longrightarrow^{hopefullly!} {\color{purple}\mu} $$ --- # Examples of Parameters Some parameters of distributions we've encountered are - `\(n\)` and `\(p\)` in `\(X\sim B(n,p)\)` with probability mass function $$ P(X=x)={n \choose x} p^x \left(1-p\right)^{n-x} $$ - `\(a\)` and `\(b\)` in `\(X\sim U(a,b)\)` with probability density function $$ f(x)=\frac{1}{b-a} $$ - `\(\mu\)` and `\(\sigma^2\)` in `\(X\sim N(\mu,\sigma^2)\)` with probability density function $$ f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\left(\frac{x-\mu}{\sigma}\right)^2} $$ --- # Mean and Variance Two population parameters of particular interest are - the mean, denoted `\(\mu\)`, defined by `\(E(X)\)` - the variance, denoted `\(\sigma^2\)`, defined by `\(E(X^2)-E(X)^2\)` We .hi[do not] observe these. Therefore, we guess using - the sample mean, `\(\bar{X}\)` - the sample variance, `\(s^2\)` Why do we use these as our guess? --- # Getting the right sample Before we talk about the properties of sample statistics, we need to make sure we have the right sample. We talked about good ways to generate a sample. .hi.it[The right sample is the most important part of any data analysis.] A .hi.green[Simple Random Sample] has no bias and has observations that are from the same population. --- # Identically Distributed If every observation is from the same population, we say all of the observations in our sample are .hi.red_pink[identically distributed]. In math, this means for any two observations `\(X_i\)` and `\(X_j\)`, $$ Pr(X_i < x) = Pr(X_j < x) $$ --- # Independent Observations Does observing `\(X_i\)` impact our best guess of `\(X_j\)`? Sometimes yes (time series, spatial dependence), but hopefully not. To simplify things, we need to assume .hi.red[independent sample observations], meaning $$ Pr(X_i=a \ \vert \ X_j=b) = Pr(X_i=a) $$ Intuitively, this means that .it[observing] one outcome doesn't help you .it[predict] any other outcome. To summarize, we want an .it[i.i.d.] sample, i.e. sample observations that are .hi.purple[independent and identically distributed]. --- # Sample Statistics are Random Variables For a sample `\(X_1,..., X_n\)` of the random variable `\(X\)`, any function of that sample, `\(\hat{\theta}=g(X_1,...,X_n)\)`, is a .hi.turquoise[sample statistic]. For example, `$${\color{turquoise}\bar{X}} = \frac{1}{n} \sum_{i=1}^{n} X_i$$` `$${\color{turquoise}\displaystyle s^2} = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$$` Because `\(X_1,..., X_n\)` are random variables, any sample statistic `\({\color{turquoise} \hat{\theta}} = g(X_1,...,X_n)\)` is itself a random variable! That means, there is some distribution for the values of `\({\color{turquoise} \hat{\theta}}\)` --- # Sampling Distributions This is one of the most important concepts in the course. One .hi[trial] would consist of the following: - .hi.green[Random Sample] - Grab a group of observations from the population - .hi.turquoise[Sample Statistic] - Take your particular random sample and calculate a sample statistic (e.g. sample mean) .hi.orange[Sampling Distribution] - Imagine repeatedly grabbing a different group of observations from the population and calculating the sample mean. This is performing many .hi[trials]. The sample means themselves will have a distribution. --- class: clear <img src="data:image/png;base64,#frame1.png" width="100%" style="display: block; margin: auto;" /> --- class: clear <img src="data:image/png;base64,#frame2.png" width="100%" style="display: block; margin: auto;" /> --- class: clear .center[ <img style="width:100%;" src="data:image/png;base64,#sample_dist.gif"/> ] --- class: clear <img src="data:image/png;base64,#frame1000.png" width="100%" style="display: block; margin: auto;" /> --- # Sample Size The variance of the .it.orange[sampling distribution] depends on the sample size. As `\(n\)` gets larger, each individual .hi[trial] gives a better guess at the mean. Hence, the .orange[sampling distribution] gets more narrow .center[ <img style="width:80%;" src="data:image/png;base64,#dist_n.gif"/> ] --- # Sampling Distributions <img src="data:image/png;base64,#sample_dist_diff_n.png" width="100%" style="display: block; margin: auto;" /> --- # Sampling Distributions We will only observe 1 sample in the world though. How does the concept of .orange[sampling distribution] help us? -- - Since we don't know the true population parameter, Our .turquoise[sample statistic] will be our best guess at the possible true value. - If we know the .orange[sampling distribution], then we can consider uncertainty about our .turquoise[sample statistic]. --- # Law of Large Numbers Is `\(\bar{X}\)` actually a good guess for `\(\mu\)`? Under certain conditions, we can use the .hi.purple[Law of Large Numbers (LLN)] to guarantee that `\(\bar{X}\)` approaches `\(\mu\)` as the sample size grows large. -- .hi[Theorem]: Let `\(X_1,X_2,...,X_n\)` be an i.i.d. set of observations with `\(E(X_i) = \mu\)`. Define the sample mean of size `\(n\)` as `\(\bar{X}_n = \frac{1}{n}\sum_{i = 1}^{n}X_i\)`. Then $$ \bar{X}_n \to \mu \quad \text{as} \quad n \to \infty. $$ Intuitively, as we observe a larger and larger sample, we average over randomness and our sample mean approaches the true population mean. --- # Law of Large Numbers .center[ <img style="width:80%;" src="data:image/png;base64,#lln.gif"/> ] --- # Law of Large Numbers <img src="data:image/png;base64,#sample_dist_diff_n.png" width="100%" style="display: block; margin: auto;" /> --- # Properties of the sample mean .hi[Theorem]: Let `\(X_1,X_2,...,X_n\)` be an i.i.d. sample with `\(E(X_i) = \mu\)` and `\(Var(X_i) = \sigma^2 < \infty\)`. Then `$$E(\bar{X}_n) = \mu$$` `$$Var(\bar{X}_n) = \frac{\sigma^2}{n}$$` Intuitively, we grab many samples from a population. The variance of our sample averages shrinks as we observe more observations per sample. --- # Clicker Question Suppose we sample 100 observations from a distribution with `\(\mu = 15\)` and `\(\sigma^2 = 25\)`. What are `\(E(\bar{X}_{100})\)` and `\(Var(\bar{X}_{100})\)`? <ol type = "a"> <li>\(E(\bar{X}_{100}) = 15\), \(Var(\bar{X}_{100}) = 25\) <li>\(E(\bar{X}_{100}) = 0.15\), \(Var(\bar{X}_{100}) = 0.25\) <li>\(E(\bar{X}_{100}) = 15\), \(Var(\bar{X}_{100}) = 5\) <li>\(E(\bar{X}_{100}) = 15\), \(Var(\bar{X}_{100}) = 0.25\) </ol> --- class: clear ## When is the sample mean Normally Distributed? Although we know the mean and variance of `\(\bar{X}\)`, we generally don't know its distribution function. .hi[Theorem]: Let `\(X_1,X_2,...,X_n\)` be an i.i.d. sample with `\(X_i \sim N(\mu, \sigma^2)\)` for `\(i=1,2,...,n\)`. Then $$ \bar{X}_n \sim N(\mu, \frac{\sigma^2}{n}). $$ Intuitively, if all the observations come from the same normal distribution then the sample average is also normally distributed and centered at the true mean (but much more narrow). --- # Central Limit Theorem What if `\(X_i\)` are not normally distributed? If the number of observation, `\(n\)`, per sample is large (we will discuss this more later), then the distribution of `\(X_i\)` doesn't matter. We will always have $$ \bar{X}_n \sim N(\mu, \frac{\sigma^2}{n}). $$